Beyond Hadoop: Recent Directions in Data Computing for Internet Services

نویسندگان

  • Zhiwei Xu
  • Bo Yan
  • Yongqiang Zou
چکیده

As a main subfield of cloud computing applications, internet services require large-scale data computing. Their workloads can be divided into two classes: customer-facing query-processing interactive tasks that serve hundreds of millions of users within a short response time and backend data analysis batch tasks that involve petabytes of data. Hadoop, an open source software suite, is used by many Internet services as the main data computing platform. Hadoop is also used by academia as a research platform and an optimization target. This paper presents five research directions for optimizing Hadoop; improving performance, utilization, power efficiency, availability, and different consistency constraints. The survey covers both backend analysis and customer-facing workloads. A total of 15 innovative techniques and systems are analyzed and compared, focusing on main research issues, innovative techniques, and optimized results. This paper reviews recent research directions in data computing for Internet services. We identify five inter-related research directions that are interesting to any cloud system with data computing workloads: (1) improving performance (response time, throughput, or job execution time); (2) improving system utilization (CPU, disk, and I/O bandwidth, etc.); (3) improving energy efficiency (saving power or energy); (4) improving system availability (availability and reliability); and (5) considering different consistency constraints. The last direction is important not only because consistency affects performance, but also due to the CAP theorem (Brewer, 2000): consistency DOI: 10.4018/ijcac.2011010104 46 International Journal of Cloud Applications and Computing, 1(1), 45-61, January-March 2011 Copyright © 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. constraints affect availability and partitioned fault tolerance. Many Internet services now use Hadoop as their main data computing platform (Hadoop, n.d.). Hadoop is also extensively used by academia as a research platform and an optimization target. Supported by a vibrant community, with increasing contributions from companies and academia, Hadoop is developing rapidly. For instance, Hadoop was originally created to handle backend data analysis jobs, including only MapReduce, HDFS and a core. Components in the current Hadoop suite (Figure 1), except HBase, are still mainly for backend applications. However, much research work is ongoing to extend Hadoop for customer-facing applications. Since Hadoop is an open source software suite organized by Apache Software Foundation, research contributions are not hindered by proprietary barriers. This paper focuses on researches that have been or can be converted and integrated into Hadoop. The rest of the paper is organized as follows: Section 2 discusses optimization techniques for backend data analysis workloads, to improve job execution time, system utilization and energy efficiency, through innovative techniques in scheduling, I/O organization and nodes disabling. Section 3 discusses optimization techniques for customer-facing interactive workloads. We survey five systems with different data models and review three optimization techniques for specific data model operations. Section 4 offers concluding remarks and points out future research problems. We only select techniques and systems that are representative. A total of 15 innovative techniques and systems are analyzed and compared, focusing on their main research issues, innovative techniques, and optimized results. 2. OPTIMIZATIONS ON BACKEND DATA ANALYSIS This section analyzes seven techniques for optimizing data computing for backend data analysis workloads. The objectives are to improve job execution time, system utilization and energy efficiency, involving innovative scheduling, I/O organization, node disabling techniques. Section 2.1 introduces three Hadoop scheduling optimization techniques. Section 2.2 discusses three enhancements using storage techniques. Section 2.3 reviews a technique to improve energy efficiency. Section 2.4 summarizes these techniques. 2.1. Scheduling Optimizations

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Cloud Computing Technology Algorithms Capabilities in Managing and Processing Big Data in Business Organizations: MapReduce, Hadoop, Parallel Programming

The objective of this study is to verify the importance of the capabilities of cloud computing services in managing and analyzing big data in business organizations because the rapid development in the use of information technology in general and network technology in particular, has led to the trend of many organizations to make their applications available for use via electronic platforms hos...

متن کامل

Data - intensive file systems for Internet services : A rose by any other

Data-intensive distributed file systems are emerging as a key component of large scale Internet services and cloud computing platforms. They are designed from the ground up and are tuned for specific application workloads. Leading examples, such as the Google File System, Hadoop distributed file system (HDFS) and Amazon S3, are defining this new purpose-built paradigm. It is tempting to classif...

متن کامل

Data - intensive file systems for Internet services : A rose

Data-intensive distributed file systems are emerging as a key component of large scale Internet services and cloud computing platforms. They are designed from the ground up and are tuned for specific application workloads. Leading examples, such as the Google File System, Hadoop distributed file system (HDFS) and Amazon S3, are defining this new purpose-built paradigm. It is tempting to classif...

متن کامل

Data-intensive File Systems for Internet Services: A Rose by Any Other Name... (CMU-PDL-08-114)

Data-intensive distributed file systems are emerging as a key component of large scale Internet services and cloud computing platforms. They are designed from the ground up and are tuned for specific application workloads. Leading examples, such as the Google File System, Hadoop distributed file system (HDFS) and Amazon S3, are defining this new purpose-built paradigm. It is tempting to classif...

متن کامل

A Model based on Cloud Computing for the implementation and management IT services in Banks

In recent years, the banking industry has made significant changes in technology and communications. The expansion of electronic communications and a large number of people around the world access to the Internet, appropriate to establish trade and economic exchanges provided but high costs, lack of flexibility and agility in existing systems because of the large volume of information, confiden...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • IJCAC

دوره 1  شماره 

صفحات  -

تاریخ انتشار 2011